About the dataset¶
Name: This column contains the full name of the athlete participating in the Olympic Games.
Sex: This column indicates the gender of the athlete. It has two unique values: "M" for male and "F" for female.
Age: This column represents the age of the athlete at the time of the competition.
Team: The name of the team or delegation that the athlete represents in Olympics.
NOC: It contains the three-letter country code assigned by the National Olympic Committee (NOC).
Year: Represents the year in which the Olympic Games took place.
Season: Indicates whether the Olympic Games occurred in the "Summer" or "Winter" season. This distinction is important because different sports are played in each season.
City: The host city where the Olympic event took place. This information can be useful for analyzing the impact of location and climate conditions on athlete performance.
Sport: Represents the broad category of the sport in which the athlete competed (e.g., Athletics, Swimming, Gymnastics).
Event: The specific event within a sport in which the athlete participated (e.g., "100m Sprint", "Long Jump").
Medal: Indicates the type of medal won by the athlete. Possible values include "Gold", "Silver", "Bronze", or "NaN" if no medal was won.
Country: This column represents the full country name corresponding to the NOC code.
Height: The height of the athlete is in centimetres.
Weight: The weight of the athlete in kilograms.
Import the Libraries and Load the Data¶
#Import the Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.figure_factory as ff
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import warnings
warnings.filterwarnings("ignore")
from sklearn.preprocessing import OneHotEncoder
Requirement already satisfied: xgboost in c:\users\user5\anaconda3\lib\site-packages (2.1.4) Requirement already satisfied: numpy in c:\users\user5\anaconda3\lib\site-packages (from xgboost) (1.24.3) Requirement already satisfied: scipy in c:\users\user5\anaconda3\lib\site-packages (from xgboost) (1.11.1)
#Loading the file
df = pd.read_csv('Athletes_summer_games.csv')
df_athlete = pd.read_csv('Olympic_Athlete_Biography.csv')
df_region = pd.read_csv('Olympic_Country_Profiles.csv')
df.sample(1)
| Unnamed: 0 | Name | Sex | Age | Team | NOC | Games | Year | Season | City | Sport | Event | Medal | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 95483 | 115751 | Mariappa Kempaiah | M | 23.0 | India | IND | 1956 Summer | 1956 | Summer | Melbourne | Football | Football Men's Football | NaN |
# Getting all column name in same format
df_region.rename(columns={'noc':'NOC','country':'Country'},inplace = True)
df_region.sample(1)
| NOC | Country | |
|---|---|---|
| 39 | CHA | Chad |
# Merging Country in main dataset
df = df.merge(df_region,on='NOC',how = 'left')
# dropping unnecessary column
df = df.drop(columns=['Unnamed: 0','Games'])
df.sample(1)
| Name | Sex | Age | Team | NOC | Year | Season | City | Sport | Event | Medal | Country | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 169486 | Alberto Ruiz Benito | M | 30.0 | Spain | ESP | 1992 | Summer | Barcelona | Athletics | Athletics Men's Pole Vault | NaN | Spain |
# dropping unnecessary column and Getting all column name in same format
df_athlete = df_athlete.drop(columns=['athlete_id','sex','born','country','country_noc','description','special_notes'])
df_athlete.rename(columns={'name':'Name','height':'Height','weight':'Weight'},inplace=True)
df_athlete.sample(1)
| Name | Height | Weight | |
|---|---|---|---|
| 12502 | Ellie Faulkner | 165.0 | 68.0 |
# Merging Height & Weight in main dataset
df = df.merge(df_athlete,on='Name',how='left')
df.sample(3)
| Name | Sex | Age | Team | NOC | Year | Season | City | Sport | Event | Medal | Country | Height | Weight | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 41641 | Natasha De'Anka "Tasha" Danvers (-Smith-) | F | 22.0 | Great Britain | GBR | 2000 | Summer | Sydney | Athletics | Athletics Women's 4 x 400 metres Relay | NaN | Great Britain | NaN | NaN |
| 137962 | Ibtihaj Muhammad | F | 30.0 | United States | USA | 2016 | Summer | Rio de Janeiro | Fencing | Fencing Women's Sabre, Team | Bronze | United States | 170.0 | 68.0 |
| 226066 | ABDEL LATIF Radwa | F | 31.0 | Egypt | EGY | 2020 | Summer | Tokyo | Shooting | 10m Air Pistol Mixed Team | NaN | Egypt | NaN | NaN |
Understanding of the data¶
##Dimensions of the data
df.shape
(241723, 14)
#Information about the dataset
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 241723 entries, 0 to 241722 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Name 241723 non-null object 1 Sex 241723 non-null object 2 Age 232357 non-null float64 3 Team 241723 non-null object 4 NOC 241723 non-null object 5 Year 241723 non-null int64 6 Season 241723 non-null object 7 City 241723 non-null object 8 Sport 241723 non-null object 9 Event 241723 non-null object 10 Medal 37196 non-null object 11 Country 241469 non-null object 12 Height 68601 non-null float64 13 Weight 68601 non-null float64 dtypes: float64(3), int64(1), object(10) memory usage: 25.8+ MB
# Converted all country values into string
df['Country'] = df['Country'].astype(str)
#Information about the dataset
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 241723 entries, 0 to 241722 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Name 241723 non-null object 1 Sex 241723 non-null object 2 Age 232357 non-null float64 3 Team 241723 non-null object 4 NOC 241723 non-null object 5 Year 241723 non-null int64 6 Season 241723 non-null object 7 City 241723 non-null object 8 Sport 241723 non-null object 9 Event 241723 non-null object 10 Medal 37196 non-null object 11 Country 241723 non-null object 12 Height 68601 non-null float64 13 Weight 68601 non-null float64 dtypes: float64(3), int64(1), object(10) memory usage: 25.8+ MB
Interpretation: The dataset consists of 2,41,723 rows representing Olympic athlete data over multiple years. It includes 14 columns with detailed information about athletes such as Name, Sex, Age, Team, Country, participated Sport with Event and Medal column represents whether he won the medal or not. Notably, Several columns like Age, Height, Weight and Medal have missing values.
# Missing values
df.isnull().sum()
Name 0 Sex 0 Age 9366 Team 0 NOC 0 Year 0 Season 0 City 0 Sport 0 Event 0 Medal 204527 Country 0 Height 173122 Weight 173122 dtype: int64
Interpretation: The data displayed shows various columns related to personal information of athletes, countries participation in olympic games. Notably, Out of 4 columns 3 columns have higher missing values which indicates that data points are not recorded or unavailable in the dataset and Medal column have large portion of Nan values means most of the athletes did not win a medal so missing values could be interpretted as Nan. This could affect the analysis related to performance of athletes in the olympic games.
# checking duplicates values.
df.duplicated().sum()
1765
Interpretation: In this dataset 1765 duplicate values which will impact the analysis related to performace of athletes in olympic games.
# Now there is no duplicate values.
df.drop_duplicates(inplace=True)
df.duplicated().sum()
0
# Checking spcific values of medals
df['Medal'].value_counts()
Medal Gold 12459 Bronze 12436 Silver 12240 Name: count, dtype: int64
Interpretition: The "Medal" column represents the count of different types of medals won in the dataset. The count for 3 categories are relatively close, indicating that the olympic medal distribution is fairly even across all three medal types. Gold medals have a slightly higher count than Silver and Bronze but the difference is small. This suggests that across all olympic events in the dataset, the number of gold medals awarded is slightly higher than the other two.
# applying One-Hot Encoding & concating it to main dataset
df = pd.concat([df,pd.get_dummies(df['Medal']).astype(int)],axis=1)
# Need to maintain consistancy in Country column
Country_Mapping={'ROC':'Russian Olympic Committee'}
df['Country'] = df['Country'].replace(Country_Mapping)
df
| Name | Sex | Age | Team | NOC | Year | Season | City | Sport | Event | Medal | Country | Height | Weight | Bronze | Gold | Silver | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | A Dijiang | M | 24.0 | China | CHN | 1992 | Summer | Barcelona | Basketball | Basketball Men's Basketball | NaN | People's Republic of China | NaN | NaN | 0 | 0 | 0 |
| 1 | A Lamusi | M | 23.0 | China | CHN | 2012 | Summer | London | Judo | Judo Men's Extra-Lightweight | NaN | People's Republic of China | 170.0 | 60.0 | 0 | 0 | 0 |
| 2 | Gunnar Nielsen Aaby | M | 24.0 | Denmark | DEN | 1920 | Summer | Antwerpen | Football | Football Men's Football | NaN | Denmark | NaN | NaN | 0 | 0 | 0 |
| 3 | Edgar Lindenau Aabye | M | 34.0 | Denmark/Sweden | DEN | 1900 | Summer | Paris | Tug-Of-War | Tug-Of-War Men's Tug-Of-War | Gold | Denmark | NaN | NaN | 0 | 1 | 0 |
| 4 | Cornelia "Cor" Aalten (-Strannood) | F | 18.0 | Netherlands | NED | 1932 | Summer | Los Angeles | Athletics | Athletics Women's 100 metres | NaN | Netherlands | NaN | NaN | 0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 241718 | ZYKOVA Yulia | F | 25.0 | Russia | ROC | 2020 | Summer | Tokyo | Shooting | 50m Rifle 3 Positions Women | Silver | Russian Olympic Committee | NaN | NaN | 0 | 0 | 1 |
| 241719 | ZYUZINA Ekaterina | F | 24.0 | Russia | ROC | 2020 | Summer | Tokyo | Sailing | Women's One Person Dinghy - Laser Radial | NaN | Russian Olympic Committee | NaN | NaN | 0 | 0 | 0 |
| 241720 | ZYUZINA Ekaterina | F | 24.0 | Russia | ROC | 2020 | Summer | Tokyo | Sailing | Women's One Person Dinghy - Laser Radial | NaN | Russian Olympic Committee | NaN | NaN | 0 | 0 | 0 |
| 241721 | ZYZANSKA Sylwia | F | 24.0 | Poland | POL | 2020 | Summer | Tokyo | Archery | Women's Individual | NaN | Poland | NaN | NaN | 0 | 0 | 0 |
| 241722 | ZYZANSKA Sylwia | F | 24.0 | Poland | POL | 2020 | Summer | Tokyo | Archery | Mixed Team | NaN | Poland | NaN | NaN | 0 | 0 | 0 |
239958 rows × 17 columns
athlete_df = df.drop_duplicates(subset=['Name','NOC'])
athlete_df['Medal'].fillna('No Medal',inplace=True)
Sex_Count = athlete_df['Sex'].value_counts().reset_index()
Sex_Count.columns = ['Sex','Count']
fig = px.bar(Sex_Count, x='Sex', y='Count', color='Sex',
title="Participating Male Female Distribution",
text = 'Count')
fig.update_layout(title_x=0.5)
fig.show()
Interpretition: There are more male athletes than the female athletes participated in the olympic games history. Specifically 94,590 and 34677 males and females respectively till 2020 edition of olympic.
athlete_df = athlete_df.dropna(subset=['Age'])
plt.hist(athlete_df['Age'],bins=5,color='skyblue')
(array([9.3913e+04, 2.7607e+04, 1.4640e+03, 1.2300e+02, 5.0000e+00]), array([10. , 27.4, 44.8, 62.2, 79.6, 97. ]), <BarContainer object of 5 artists>)
Interpretation: The histogram shows that most athletes are concentrated in the younger age group, particularly between 10 and 30 years old. The number of athletes declines sharply after age 30, with very few participating beyond 60. This suggests that competitive athletic participation is heavily skewed toward younger individuals.
plt.figure(figsize=(10,5))
sns.boxplot(x="Medal", y="Age", data=df, order=['Gold', 'Silver', 'Bronze'], palette="coolwarm")
plt.title("Age Distribution of Medalists")
plt.xlabel("Medal Type")
plt.ylabel("Age")
plt.show()
Interpretation: The boxplot shows that the median age of medalists is similar across all medal types Gold, Silver and Bronze ranging around the mid-20s. Most medal winners are between 20 and 30 years old, with a few outliers extending into the 40s and beyond. This indicates that top performance tends to occur during early adulthood.
x1 = athlete_df['Age'].dropna().tolist()
x2 = athlete_df[athlete_df['Medal'] == 'Gold']['Age'].dropna().tolist()
x3 = athlete_df[athlete_df['Medal'] == 'Silver']['Age'].dropna().tolist()
x4 = athlete_df[athlete_df['Medal'] == 'Bronze']['Age'].dropna().tolist()
fig = ff.create_distplot([x1,x2,x3,x4],['Overall Age', 'Gold Medalist', 'Silver Medalist', 'Bronze Medalist'],show_hist=False,show_rug=False)
fig.update_layout(
title= "Age Analysis of Olympic Medalists Across Different Categories",
title_x=0.45,
title_y=0.85)
fig.show()
Interpretation: The distribution plot shows that most Olympic medalists, regardless of medal type, tend to be in their early to mid-20s, with peak density around age 24. Gold medalists have a slightly sharper peak, suggesting a tighter age range for top performance. Overall, age trends are quite similar across all medal categories.
plt.figure(figsize=(12,6))
plt.subplot(1,2,1)
sns.histplot(athlete_df['Height'],bins=30,kde=True,color='blue')
plt.title('Height Distribution of athletes')
plt.subplot(1,2,2)
sns.histplot(athlete_df['Weight'],bins=30,kde=True,color='green')
plt.title('Weight Distribution of athletes')
plt.show()
Interpretation: The height distribution of athletes is roughly normal, with most athletes falling between 165 cm and 185 cm. The weight distribution is slightly right-skewed, indicating a higher concentration around 65–75 kg but with a longer tail toward heavier weights. Both distributions show typical ranges for elite athletes.
plt.figure(figsize=(8, 6))
sns.scatterplot(data=athlete_df, x="Weight", y="Height",hue='Sex',alpha=0.5)
plt.title("Height vs. Weight of Athletes")
plt.xlabel("Weight (kg)")
plt.ylabel("Height (cm)")
plt.show()
Interpretation: The scatter plot shows a positive correlation between height and weight among athletes, with males generally being taller and heavier than females. There's a clear clustering, with male athletes occupying the upper-right region and female athletes the lower-left, reflecting typical physiological differences.
plt.figure(figsize=(12,6))
plt.subplot(1,2,1)
sns.boxplot(athlete_df,x='Sex',y='Height',palette='pastel')
plt.title('Height Distribution by Genders')
plt.subplot(1,2,2)
sns.boxplot(athlete_df,x='Sex',y='Weight',palette='muted')
plt.title('Weight Distribution by Genders')
plt.show()
Interpretation: The box plots show that male athletes generally have greater height and weight compared to female athletes. Both distributions exhibit some outliers, but the median height and weight are clearly higher for males, reflecting typical physiological differences between the genders.
medal_order = ['Gold', 'Silver', 'Bronze']
medalists_df = athlete_df[athlete_df['Medal'].isin(medal_order)]
medalists_df['Medal'] = pd.Categorical(medalists_df['Medal'], categories=medal_order, ordered=True)
fig = make_subplots(rows=1, cols=2, subplot_titles=("Height Distribution of Medalist", "Weight Distribution of Medalist"))
medal_colors = {'Gold': 'gold', 'Silver': 'silver', 'Bronze': 'brown'}
for medal in medal_order:
temp_df = medalists_df[medalists_df['Medal'] == medal]
fig.add_trace(go.Box(y=temp_df["Height"], name=f"{medal} Medal", marker_color=medal_colors[medal]), row=1, col=1)
for medal in medal_order:
temp_df = medalists_df[medalists_df['Medal'] == medal]
fig.add_trace(go.Box(y=temp_df["Weight"], name=f"{medal} Medal", marker_color=medal_colors[medal]), row=1, col=2)
fig.update_layout(title_text="Height & Weight Distributions of Medalists (Ordered by Medal Type)",title_x=0.5,showlegend=True)
fig.show()
Interpretation: The box plots show that gold medalists tend to be slightly taller and heavier on average compared to silver and bronze medalists. While the distributions overlap, gold medalists have a higher median in both height and weight, suggesting a potential physical advantage in certain sports.
plt.figure(figsize=(8,8))
selected_sport = 'Weightlifting'
temp_df = athlete_df[athlete_df['Sport']== selected_sport]
sns.scatterplot(temp_df,x='Weight',y='Height',hue='Medal',style='Sex',s=100)
plt.title(f'Height vs Weight Distribution in {selected_sport}', fontsize=14)
Text(0.5, 1.0, 'Height vs Weight Distribution in Weightlifting')
Interpretation: In weightlifting, medalists (especially gold and bronze) are spread across a range of weights, but tend to cluster in the mid-to-high weight categories. Both male and female athletes compete across the height-weight spectrum, though males (marked with "x") dominate the higher weight and height ranges.
fig = px.scatter(df, x="Weight", y="Height", color="Sport",
title="Height vs Weight of Athletes by Sports",
hover_data=['Name', 'Sex', 'Medal'])
fig.update_layout(title_x=0.5)
fig.show()